White Wine Exploration by Arthur Lin

1. Data Summary

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

2. Univariate Plots Section

  1. fixed.acidity

the distribution of fixed.acidity looks like a normal distribution with some outliers.

  1. volatile.acidity

the distribution of volatile.acidity looks like what fixed.acidity get with different mean and std.

  1. citric.acid

the distribution of citric.acid looks like the two above but with a strange peak around 0.5

  1. residual.sugar

the distribution of residual.sugar is highly left-skewed, so I transform it using scale_x_log10 and the new distrubution appears bimodal.

  1. chlorides

the distribution looks like a normal distribution below 0.1 and have many huge outliers.

  1. free.sulfur.dioxide

the distribution just looks like the above one, it’s a normal distribution below 100 and have some huge outliers.

  1. total.sulfur.dioxide

No surprise, the distribution looks like the above two.

  1. density

the variation is vary small, but if I take a close look at range between 0.99 and 1.005, the distribution is more like a uniform distribution.

  1. pH

It’s a beautiful normal distribution without any extreme outlier!

  1. sulphates

Basically, it’s a beautiful normal distribution, too.

  1. alcohol

It’s a slight left-skewed distribution without extreme outliers.

  1. quality

It is more like a categorical variable with only 7 different values, so I made a new column called quality_level and label ‘A’ to the quality of 9 and ‘B’ for 8 and so on…

## 
##    G    F    E    D    C    B    A 
##   20  163 1457 2198  880  175    5

3. Univariate Analysis

Some Observations

There is no categorical variable in the whole data set. Most of the numerical variables have some outliers that make the Max number much higher than the average of them, only the values of variable density and pH are almost all the same, and the max value in variable alcohol and quality doesn’t lie too far away from the mean.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the data set is the quality, I will do some further exploration to see the correlation between it and other features and try to find which features combined affect the quality most.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think the alcohol feature will support my investigation much, since it is what affect a wine the most by intuition.

Since every variable may affect the density, I think density may be another features that I want to take a closer look at.

Did you create any new variables from existing variables in the dataset?

I create quality_level from quality and changed it from numerical value to categorical value. I think it will be helpful when I want to draw a relationship between others variables and using quality as classification variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the residual.sugar distributions and the transformed distribution looks like bimodal distribution with no clear peak.

I cut of outliers for most of the variables and most of them look like normal distribution after doing that, pH, sulphates and quality are the only three variables that I don’t need to do any further manipulation.

citric.acid have two unusual peaks around 0.5 and 0.75.

4. Bivariate Plots Section

  1. correlations between each variable
##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

Most of the variables didn’t have high correlation.

The variables total.sulfur.dioxide and free.sulfur.dioxide have correlation coefficient 0.6 and the reason is obvious.

residual.sugar have high positive correlation (0.84) with density and alcohol have high negative correlation(-0.78) with density.

Not surprisingly, alcohol have the highest correlation coefficient with quality(0.435).

I want to have a closer look at the relationship between quality and other variables

  1. quality & alcohol

It didn’t look like have any correlation between them, since the quality have only seven possible values and may overplotting, so I use jitter to make a clearer plot.

Now there is a slight positive relationship between this two variables, but still not strong. This meets the correlation coefficient value 0.435 I get before.

  1. quality & density

The value has the next highest absolute value of correlation with quality is density.

The result shows a blurred trend and meets with the data I get before.

  1. alcohol & density

Since this two variable have strong negative correlation (-0.78) with each other, and have high relationship with quality both, if I want to investigate the variables that affect the quality, they may be covariant variables that I need to deal with, so I want to look at the scatter plot made by them.

  1. residual.sugar & density

Although residual.sugar have little relationship with quality, but it have strong relationship with density, actually, it is the strongest correlation value(0.84) in the whole dataset.

  1. alcohol histogram categorized by quality_level

There are only 5 records with quality_level ‘A’(best) and 20 records with quality_level ‘G’(worst). The plot show that most of the level ‘A’ wines have alcohol more than 12 and most of the level ‘G’ wines is lower than 12. Most of wines have alcohol around 10 and only a few there have quality above ‘C’.

level A:

## [1] 10.4 12.4 12.5 12.7 12.9

level G:

##  [1]  9.8 11.7  8.5 11.5 12.6  9.6  9.1 12.4 11.0  9.1  9.4 11.0  9.7 10.4
## [15] 10.1  8.0 11.0 11.0 10.5 10.5
  1. quality & chlorides

The variable with the next high correlation coefficient with quality is chlrides

Most of the chlorides data lies in under 0.1, I look both the part under 0.1 and those outliers above 0.1, but there is no much surprise.

  1. free.sulfur.dioxide & total.sulfur.dioxide & quality

The two sulfur-related variables have strong correlation with each other, but both of them seems have no relationship with quality.

  1. total.sulfur.dioxide, residual.sugar & alcohol

It’s interesting that alcohol have some negative correlation with total.sulfur.dioxide(-0.44) and residual.sugar(-0.45), and there are some positive relation(0.40) between total.sulfur.dioxide & residual.sugar.

It seems that no matter how much the alcohol is, there is always white wines with low residual.sugar. However, when the alcohol goes higher, the chance of getting high residual.sugar white wine is getting lower.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality have no strong relationship with any variables, only alcohol and density have some relationship with it but not obvious. However, there are strong correlation between alcohol and density, so maybe only one variable, which I think is alcohol, have some relationship with quality. Other variables don’t affect the quality much.

There are too few example for the best and worst level, only 5 and 20 for them. It may need more data to justify what is the best farmula for level ‘A’ white wine. If I made color with different levels, I can only see the dot with level ‘B’ to ‘E’ to find some pattern.

The density is more disperse when the residual.sugar and alcohol is lower, but the variance is very small no matter what. The record with outlier of residual.sugar(65.8) also have extreme density value(1.03898) without surprising.

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782           7.8            0.965         0.6           65.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 2782     0.074                   8                  160 1.03898 3.39
##      sulphates alcohol quality quality_level
## 2782      0.69    11.7       6             D

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The residual.sugar have some relationship with density and density have some relationship with quality, but residual.sugar seems to have no relationship with quality. I guess the reason is that although the residual.sugar will changes the density drastically, it is alcohol that changes the quality and also change the density. So the relationship between density and quality is due to alcohol instead of residual.sugar.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The alcohol values have some relationship with total.sulfur.dioxide and residual.sugar, but I don’t know the scientific reason for this. Maybe it is caused by the brewing procedure.

What was the strongest relationship you found?

The strongest relationship between all the variable are residual.sugar and density(0.84). The second strongest relationship is alcohol and density with negative value(-0.78).

5. Multivariate Plots Section

  1. alcohol, density and quality

Since alcohol and density are the two variables with highest relationship with quality, I want to see the relationship between these three variables

There are more level A-C wines lies in the area of low density and high alcohol, but high alcohol value somehow imply low density, so I may need to discard one of them if I want to do linear regression to predict the quality. I think density should be skipped since its value only have tiny variance and alcohol sounds more direct related to the quality of wine by intuition.

  1. closer look at alcohol, density and quality

It’s interesing that this trend is not monotonic, the level E white wine have the highest average density and lowest average alcohol. Moreover, I think density is not a good feature to determine the level of white wine, since there are too few level A samples and lots of outliers for level B and level C samples. However, the alcohol feature also have much ourliers in level E quality and few outliers in level F, so the trend is basically monotonic, that’s why it have 0.44 correlation with quality.

  1. strange peak of citric.acid

There are two strange peaks at the histogram of citric.acid, I want to see what happened to these data. The peak around 0.5 is the most significant one, so I will start with it.

## Warning: position_stack requires non-overlapping x intervals

using barplot to make sure the strange peak is at 0.49, looks like there are no abnormal quality destribution in this point. The citric.acid features have no significant relationship with all other features and have almost zero relationship (-0.0092) with quality. The most significant relationship between citric.acid and other features is 0.28 (with fixed.acidity), so I try to look at it.

just one outlier of fixed.acitidy, no more strange things happen.

I create a subset with only the data with citric.acid equal to 0.49 (there are 215 records) and compare the summary and correlation with the original dataset.

summary of each dataset: (first: original, second: subset with citric.acid equal 0.49)

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##                                                                        
##     alcohol         quality      quality_level
##  Min.   : 8.00   Min.   :3.000   G:  20       
##  1st Qu.: 9.50   1st Qu.:5.000   F: 163       
##  Median :10.40   Median :6.000   E:1457       
##  Mean   :10.51   Mean   :5.878   D:2198       
##  3rd Qu.:11.40   3rd Qu.:6.000   C: 880       
##  Max.   :14.20   Max.   :9.000   B: 175       
##                                  A:   5
##        X        fixed.acidity    volatile.acidity  citric.acid  
##  Min.   : 280   Min.   : 5.600   Min.   :0.0800   Min.   :0.49  
##  1st Qu.:1478   1st Qu.: 6.800   1st Qu.:0.2000   1st Qu.:0.49  
##  Median :1554   Median : 7.400   Median :0.2500   Median :0.49  
##  Mean   :1710   Mean   : 7.489   Mean   :0.2629   Mean   :0.49  
##  3rd Qu.:1626   3rd Qu.: 8.000   3rd Qu.:0.3000   3rd Qu.:0.49  
##  Max.   :4679   Max.   :14.200   Max.   :0.8500   Max.   :0.49  
##                                                                 
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.02700   Min.   : 3.00      
##  1st Qu.: 1.500   1st Qu.:0.03600   1st Qu.:21.50      
##  Median : 5.000   Median :0.04400   Median :32.00      
##  Mean   : 5.793   Mean   :0.04558   Mean   :33.61      
##  3rd Qu.: 8.100   3rd Qu.:0.05100   3rd Qu.:45.00      
##  Max.   :23.500   Max.   :0.23900   Max.   :87.00      
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   : 18.0        Min.   :0.9893   Min.   :2.850   Min.   :0.2700  
##  1st Qu.:113.5        1st Qu.:0.9928   1st Qu.:3.065   1st Qu.:0.3750  
##  Median :138.0        Median :0.9940   Median :3.140   Median :0.4500  
##  Mean   :141.2        Mean   :0.9943   Mean   :3.163   Mean   :0.4623  
##  3rd Qu.:164.0        3rd Qu.:0.9956   3rd Qu.:3.240   3rd Qu.:0.5300  
##  Max.   :247.0        Max.   :1.0024   Max.   :3.650   Max.   :0.9800  
##                                                                        
##     alcohol         quality      quality_level
##  Min.   : 8.50   Min.   :4.000   G:  0        
##  1st Qu.: 9.70   1st Qu.:5.000   F:  9        
##  Median :10.50   Median :6.000   E: 54        
##  Mean   :10.48   Mean   :5.893   D:110        
##  3rd Qu.:11.20   3rd Qu.:6.000   C: 36        
##  Max.   :13.00   Max.   :9.000   B:  5        
##                                  A:  1

difference in correlation: (correlation of each variable in the new dataset minus the original one)

##                                X fixed.acidity volatile.acidity
## X                     0.00000000   0.169338319      0.081109140
## fixed.acidity         0.16933832   0.000000000      0.007850737
## volatile.acidity      0.08110914   0.007850737      0.000000000
## citric.acid                   NA            NA               NA
## residual.sugar        0.05076183  -0.102222907      0.198221706
## chlorides             0.04882760  -0.126606540     -0.007668191
## free.sulfur.dioxide   0.05408238  -0.169224582      0.211615334
## total.sulfur.dioxide  0.08679813  -0.269339593      0.129207253
## density               0.20116234  -0.187003123      0.129208969
## pH                   -0.07145433   0.073941212     -0.064242041
## sulphates             0.28740524  -0.136438565     -0.042287170
## alcohol              -0.25512162   0.233466205     -0.002368664
## quality              -0.05382213   0.014865896      0.070077310
##                      citric.acid residual.sugar    chlorides
## X                             NA     0.05076183  0.048827596
## fixed.acidity                 NA    -0.10222291 -0.126606540
## volatile.acidity              NA     0.19822171 -0.007668191
## citric.acid                    0             NA           NA
## residual.sugar                NA     0.00000000 -0.016974896
## chlorides                     NA    -0.01697490  0.000000000
## free.sulfur.dioxide           NA     0.20464880 -0.034777585
## total.sulfur.dioxide          NA     0.11837207 -0.006519063
## density                       NA    -0.01579581 -0.034730974
## pH                            NA     0.12975861  0.136028548
## sulphates                     NA    -0.04865309  0.120901945
## alcohol                       NA     0.13007051  0.108414353
## quality                       NA     0.09646195  0.009786642
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                             0.05408238          0.086798132  0.20116234
## fixed.acidity                -0.16922458         -0.269339593 -0.18700312
## volatile.acidity              0.21161533          0.129207253  0.12920897
## citric.acid                           NA                   NA          NA
## residual.sugar                0.20464880          0.118372067 -0.01579581
## chlorides                    -0.03477759         -0.006519063 -0.03473097
## free.sulfur.dioxide           0.00000000          0.125635140  0.14705069
## total.sulfur.dioxide          0.12563514          0.000000000  0.02359608
## density                       0.14705069          0.023596084  0.00000000
## pH                            0.09835056          0.122556755  0.17865955
## sulphates                    -0.08556566         -0.126602800  0.01771059
## alcohol                       0.02584678          0.124724373  0.09505501
## quality                       0.06932216          0.095331822  0.02175604
##                                pH   sulphates      alcohol      quality
## X                    -0.071454330  0.28740524 -0.255121617 -0.053822126
## fixed.acidity         0.073941212 -0.13643856  0.233466205  0.014865896
## volatile.acidity     -0.064242041 -0.04228717 -0.002368664  0.070077310
## citric.acid                    NA          NA           NA           NA
## residual.sugar        0.129758611 -0.04865309  0.130070509  0.096461950
## chlorides             0.136028548  0.12090194  0.108414353  0.009786642
## free.sulfur.dioxide   0.098350558 -0.08556566  0.025846779  0.069322164
## total.sulfur.dioxide  0.122556755 -0.12660280  0.124724373  0.095331822
## density               0.178659548  0.01771059  0.095055007  0.021756044
## pH                    0.000000000 -0.06671252 -0.113451722 -0.001141212
## sulphates            -0.066712519  0.00000000 -0.124292817 -0.068626260
## alcohol              -0.113451722 -0.12429282  0.000000000  0.044543108
## quality              -0.001141212 -0.06862626  0.044543108  0.000000000

I still can’t find any significant difference in the data shows above. The mean, median and correlation with other variables doesn’t have any obvious change in the new dataset. So I think it may just happened by chance without further reason.

  1. pH, fixed.acidity and quality

The only feature that seems to have some effect to the pH values is fixed.acidity, but none of them affect the quality much.

  1. residual.sugar & density & quality

As I mentioned above, the residual.sugar have strong relationship with density and density have strong relationship with quality, but residual.sugar seems have no relationship with quality.

I try to apply log 10 transformation on residual.sugar in the last plot since its distribution is highly left skewed, but the plot din’t show much more insight than the non-transformed one.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I try to take a closer look at those values that have some relationship with quality. I got more idea about how the values distributed but didn’t get a better farmula to predict the quality.

Were there any interesting or surprising interactions between features?

The surprising feature is that the boxplot shows that the relationship between alcohol and quality_level is not monotonic if we skipped those values that seem to be outliers in each level.

I try to look at other interesting feature, the strange peak at citric.acid, even it seems to have nothing to do with my interest feature - “quality”. However, I can’t find any significant difference between those records with 0.49 citric.acid and others. So I think it just happen by chance.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I try to find more relationship with quality and other features, but I can’t find other features that seems to have strong relationship with quality even doing log or power transformation. The only two features that quality have a high relationship with, alcohol(0.44) and density(-0.3), seems to correlated with each other and I need to choose one of them to use. After all, the best model I can get is around 0.2 R-squared value. So I don’t have a great model to show.

Final Plots and Summary

Plot One

Description One

Alcohol is the features that have the highest correlation value(0.44) with quality, so I want to show its relation with quality by this plot.

correlation coefficient between quality and alcohol

##           alcohol
## quality 0.4355747

outlier numbers of alcohol for each quality_level

## df$quality_level: G
## [1] 0
## -------------------------------------------------------- 
## df$quality_level: F
## [1] 2
## -------------------------------------------------------- 
## df$quality_level: E
## [1] 35
## -------------------------------------------------------- 
## df$quality_level: D
## [1] 0
## -------------------------------------------------------- 
## df$quality_level: C
## [1] 0
## -------------------------------------------------------- 
## df$quality_level: B
## [1] 1
## -------------------------------------------------------- 
## df$quality_level: A
## [1] 1

standard deviation of alcohol values for each quality_level

## df$quality_level: G
## [1] 1.224089
## -------------------------------------------------------- 
## df$quality_level: F
## [1] 1.003217
## -------------------------------------------------------- 
## df$quality_level: E
## [1] 0.8470653
## -------------------------------------------------------- 
## df$quality_level: D
## [1] 1.147776
## -------------------------------------------------------- 
## df$quality_level: C
## [1] 1.246536
## -------------------------------------------------------- 
## df$quality_level: B
## [1] 1.280138
## -------------------------------------------------------- 
## df$quality_level: A
## [1] 1.01341

The outlier numbers fit with the plot. The standard deviation value surprising me at first place, I thought the level E will have the largest std value since it has most outliers, however, the result shows that it has the smallest std value(0.85). After a second thought, I think it is right since it is the small std value that causes a lot of outliers. More precisely, according to the formula of outliers of geom_box, it is the small Q3 and Q1 difference that make a lot of samples be considered as outliers in the boxplot.

The std value of level A quality wines is almost as large as other levels is another feature that surprising me since the data seems to concentrate in a small range in the boxplot. I think it is because there are only five samples in the data so one small outlier will raise the std value hugely.

Overall, this plot shows that the better the quality of the wine, the higher the alcohol percentage it has in general. It meets with the correlation coefficient value above.

Plot Two

Description Two

These two features have the strongest correlation(0.84) in the whole dataset, so I want to explore this two features and see their relationship with quality.

correlation coefficient between density, residual.sugar and quality

##                   density residual.sugar     quality
## density         1.0000000     0.83896645 -0.30712331
## residual.sugar  0.8389665     1.00000000 -0.09757683
## quality        -0.3071233    -0.09757683  1.00000000

mean density value of each quality level

## df$quality_level: G
## [1] 0.994884
## -------------------------------------------------------- 
## df$quality_level: F
## [1] 0.9942767
## -------------------------------------------------------- 
## df$quality_level: E
## [1] 0.9952626
## -------------------------------------------------------- 
## df$quality_level: D
## [1] 0.9939613
## -------------------------------------------------------- 
## df$quality_level: C
## [1] 0.9924524
## -------------------------------------------------------- 
## df$quality_level: B
## [1] 0.9922359
## -------------------------------------------------------- 
## df$quality_level: A
## [1] 0.99146

mean residual.sugar value of each quality level

## df$quality_level: G
## [1] 6.3925
## -------------------------------------------------------- 
## df$quality_level: F
## [1] 4.628221
## -------------------------------------------------------- 
## df$quality_level: E
## [1] 7.334969
## -------------------------------------------------------- 
## df$quality_level: D
## [1] 6.441606
## -------------------------------------------------------- 
## df$quality_level: C
## [1] 5.186477
## -------------------------------------------------------- 
## df$quality_level: B
## [1] 5.671429
## -------------------------------------------------------- 
## df$quality_level: A
## [1] 4.12

The trend of mean value of residual.sugar and density at each quality level meets with the correlation coefficient and the plot.

We can see the strong relationship between density and residual.sugar clearly in this plot. Moreover, another interesting thing that can be seen here is that although density has some relationship with quality (horizontal color difference), the residual.sugar seems to have no relationship with quality (vertical color difference). I think it is because density is not the feature that ‘truly’ cause the difference of quality, it is the alcohol that makes the influence on both quality and density. Residual.sugar is just another feature that affect density much but doesn’t affect the quality much.

Plot Three

Description Three

Following the conclusion above, I want to take a closer look at the relation between alcohol, density and quality.

correlation coefficient between density, alcohol and quality

##            density    alcohol    quality
## density  1.0000000 -0.7801376 -0.3071233
## alcohol -0.7801376  1.0000000  0.4355747
## quality -0.3071233  0.4355747  1.0000000

mean of density values for each quality_level

## df$quality_level: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## -------------------------------------------------------- 
## df$quality_level: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## -------------------------------------------------------- 
## df$quality_level: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## -------------------------------------------------------- 
## df$quality_level: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## df$quality_level: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## df$quality_level: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## -------------------------------------------------------- 
## df$quality_level: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

standard deviation of density values for each quality_level

## df$quality_level: G
## [1] 0.002830599
## -------------------------------------------------------- 
## df$quality_level: F
## [1] 0.002462357
## -------------------------------------------------------- 
## df$quality_level: E
## [1] 0.002544734
## -------------------------------------------------------- 
## df$quality_level: D
## [1] 0.00302351
## -------------------------------------------------------- 
## df$quality_level: C
## [1] 0.00276767
## -------------------------------------------------------- 
## df$quality_level: B
## [1] 0.002787724
## -------------------------------------------------------- 
## df$quality_level: A
## [1] 0.003118373

Those high-quality records are concentrated in the upper-left corner. In the meanwhile, there is a clear linear regression line which indicates that the correlation between density and alcohol is high, which means that they are covariance variables. Moreover, from the summary and standard deviation data of density in each quality levels we can see that the density didn’t have much variance in the whole dataset no matter which quality level it is, if it have some direct correlation with quality means that adding a little more high-density materials may change the quality of wine drastically and that just didn’t make any sense. It’s much reasonable to believe that it’s the alcohol that makes the main influence on white wine quality and density is just a covariable that varies with alcohol slightly.

Reflection

In this analysis project, I started by looking at the distribution of each feature and find out that most of them are normal distribution after cutting out those outliers. After having some basic idea of each feature, I try to look at the correlation value between all the features and put most my attention on the correlation value between quality and others. Most of the plot didn’t surprising me much, which is happy on the one hand, but worried on the other. It’s happy that the plot can verify the numerical data well but worried that I can’t find some underlying relationship that can be used for the prediction model that I want to make. It is a little frustrated that I can’t get a good prediction model in the end, but I think things like this is not surprising in the real world. If I want to do further exploration, I should learn more knowledge about wine and chemistry, that may help me figured out how to manipulate my data to find out some underline relations. Another thing I can try in the future is using more different learning models to try out the result, maybe SVM will help me find out those kernel functions that can help predict the quality and I just didn’t see here. Finally, finding more data may be the most straightforward way to improve my model, but this may not be an easy job to do.

Reference

[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib